feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14
Merged
feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14
Conversation
…tion Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through the full mining pipeline and outputs detailed metrics as JSON. The benchmark exercises the complete flow: GH Archive ingestion → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export. Code changes in src/cli/commands.rs: - Added SweBenchmarkArgs struct with configurable parameters (count, min-stars, languages, model, api-key, cache-db, output directory) - Added Benchmark variant to SweSubcommand enum - Implemented run_swe_benchmark_command async handler that validates API keys, configures SweOrchestrator, runs the pipeline, and outputs JSON results README.md updated with comprehensive English benchmark results from a run processing 100 PRs (2026-02-17), including: - Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield) - Difficulty distribution (81.8% medium, 18.2% easy, 0% hard) - Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30) - Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR) - Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%) - Accepted task listing with scores - Test generation failure analysis - Usage instructions for running the benchmark Benchmark artifacts added: - benchmark-output/ with 8 accepted task directories, each containing workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts - benchmark_output.json and benchmark_results.json with raw pipeline output - benchmark_clean.log with pipeline execution log
echobt
added a commit
that referenced
this pull request
Apr 8, 2026
…nalysis (#14) * feat(swe): add BenchmarkMetrics tracking to pipeline and orchestrator * feat(benchmark): add benchmark command and pipeline metrics documentation Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through the full mining pipeline and outputs detailed metrics as JSON. The benchmark exercises the complete flow: GH Archive ingestion → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export. Code changes in src/cli/commands.rs: - Added SweBenchmarkArgs struct with configurable parameters (count, min-stars, languages, model, api-key, cache-db, output directory) - Added Benchmark variant to SweSubcommand enum - Implemented run_swe_benchmark_command async handler that validates API keys, configures SweOrchestrator, runs the pipeline, and outputs JSON results README.md updated with comprehensive English benchmark results from a run processing 100 PRs (2026-02-17), including: - Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield) - Difficulty distribution (81.8% medium, 18.2% easy, 0% hard) - Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30) - Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR) - Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%) - Accepted task listing with scores - Test generation failure analysis - Usage instructions for running the benchmark Benchmark artifacts added: - benchmark-output/ with 8 accepted task directories, each containing workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts - benchmark_output.json and benchmark_results.json with raw pipeline output - benchmark_clean.log with pipeline execution log * ci: trigger CI run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new
benchmarkCLI command that evaluates the SWE pipeline against a batch of PRs, collecting detailed metrics on filtering, difficulty distribution, quality, and throughput. Results are persisted as JSON and documented in the README.Changes
benchmarksubcommand (src/cli/commands.rs): runs the pipeline on a configurable number of PRs (default 100) and outputs structured results including:BenchmarkMetricstracking (src/swe/pipeline.rs,src/swe/orchestrator.rs): instrument the pipeline to capture per-PR timing, filtering decisions, and difficulty classification during benchmark runsbenchmark-output/,benchmark_results.json,benchmark_output.json): sample benchmark results across 8 repositories covering Go, Java, Python, and Rust projectsNotes
benchmarksubcommand